Google Data Analytics Capstone Project

And so begins my first ever analysis. Using the skills I’ve learnt throughout the course I aim to showcase them through this project, following the data analysis process of : Ask, Prepare, Process, Analyse, Share, and Act.
Initially when beginning the project I had the aim to use the tools I learnt during this course which would be:
- Use Google sheets to do initial cleaning.
- Use SQL - BigQuery for sorting & analysis.
- Finish off on R programming/Tableau for further analysis, visualisation and reporting.

Unfortunately due to the size of the data sets I was working on, Google sheets struggled & so did excel when I attempted there. I knew I would be limited and would have to improvise with BigQuery, but as I tried to upload a single uncleaned file I got over 200 errors… Thus I ended up using Microsoft’s Power Query to do the cleaning phase required for this project. As it was something I was planning of learning next, using power query was the saving grace that I will detail below.
Time is hard to come by, I’ve included a summary of this project on my github in the README apposed to this full review.

Three people who I’d like to mention are:
Equitable Equations: whose video’s I started to watch alongside this course’s module on R programming. His content has helped me alot during this project, as I was supplementing my learning from the course with his content.
Kevin Stratvert: apart from Microsofts own power query tutorials I referred to his video’s in learning power query. Additionally throughout the course when learning about spreadsheets, I watched his video’s to see the microsoft way of doing the steps shown on google sheets.
Alex The Analyst: With little knowledge on uploading of projects online I looked at his video’s on how to build the website via github, he has project walkthrough’s which I might attempt at a later point, especially the SQL ones.

Scenario

I’ve been hired as a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director believes the future success of the company depends on the growth of annual membership. So looking at the data from the last 12 months I hope to understand how the companies casual and annual members differ. From there I am to make 3 suggesting with my findings for new marketing strategies that could convert current casual membership to annual membership.

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Ask

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

Prepare

The company has provided historical trip data in the form of csv files that are divided by month to analyse.
As Cyclistic is a fictional company, I’ll be using data provided by Motivate International Inc under this licence.
A concept that was taught during the course was “ROCCC”: Reliable, Original, Comprehensive, Current, Citied.

Looking at the individual months of data provided: I have 13 columns:
ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng & member_casual

The Game plan

  • I’ve identified that ride_id is the primary key which would have been helpful in SQL, started_at & ended_at will provide me dates, time, month & duration.
  • start_lat, start_lng, end_lat & end_lng will provide me with distance. Member_casual will be important for comparison and whole counts.
  • While start_station_name, start_station_id, end_station_name & end_station_id will provide me with filtering and recommendations.

    Considering the scenario, I believe that this information does “ROCCC”, but along the way I will realise that the Comprehensive part falls short.

Process

Complication

As I eluded to before, when working with these files on Google sheets I kept running into numerous issues. The documents kept crashing, attempting to create new columns with filtered data or calculations kept bringing up error’s & very slow updates even with 32GB ram.
I spent a glorious 2 days trying to make it work, but with these datasets having over 600,000 rows seemed too much for both Google sheets and excel.
That’s when I considered doing the filtering on Google BigQuery, I would be limited as the free version does not support deleting but it would just mean some of my filtering would require additional lines of code to exclude values not required. As I go to upload the first file, I get 203 individual errors with the csv I try to upload. Not ideal!

Resolution

After abit of research online, alot of other people attempting this capstone pointed out they were running into similar issues. Noting that Power Query is helpful work around.
After learning the platform and with some trial and error I was able to clean the data on Power Query following these steps:

  1. Filter Blanks: Remove blanks from individual columns.
  2. Extract Dates: Use Date.From([–]) to extract dates from started_at and ended_at.
  3. Match Dates: Create a new column to check if dates match:
    if [column1] = [column2] then “match” else “no match”
    Filter out non-matching dates and delete the column.
  4. Extract Times: Use Time.From([–]) to extract times from started_at and ended_at.
  5. Calculate Duration: Create new column duration by using [end time] - [start time] columns.
  6. Day of the Week: Use Date.ToText([], "dddd") to get the day of the week.
  7. Year and Month: Use Date.ToText([], "YY-mm") to extract year and month.
  8. Hour with AM/PM: Use Time.ToText([], "h tt") to get the hour with AM/PM.
  9. Calculate Distance: Using this formula to calculate distance using starting and ending lat/long.
  10. filtering coordinates to Chicago’s location:
    • Latitude: start_lat and end_lat between 41.6445 and 42.0230.
    • Longitude: start_lng and end_lng between -87.9401 and -87.5237.


After completing the filtering process, I transferred the newly filtered data to individual sheets that were organized by month for a quick review. The original sheets named in the format 202309-divvy-tripdata, were saved separately with new file names corresponding to the original titles, such as September-2023. I created a couple of pivot tables for selected months so I can get some initial interpretations on what the results look like.
alt text alt text
Using the pivot tables I was able to mock up some visualisations on Excel that give me some initial understanding on the what the data story would look like. It was actually at this point I had some questions on the data, that would lead to my assumption on the Comprehensive flaw. I’ve left a clue in the two examples above.

Analyse

Orignally the plan was to use my newly filtered data to upload to R programming for analysis, but due to my concern identified in the Process phase I decided to re do it all on R. This would give me the chance to work with R, ensure my work is bulletproof and have some fun with it.

setwd("path/to/your/directory") all_files <- list.files(pattern = "*.csv") files <- all_files[all_files %in% c("202309-divvy-tripdata.csv", "202310-divvy-tripdata.csv", "202311-divvy-tripdata.csv", "202312-divvy-tripdata.csv", "202401-divvy-tripdata.csv", "202402-divvy-tripdata.csv", "202403-divvy-tripdata.csv", "202404-divvy-tripdata.csv", "202405-divvy-tripdata.csv", "202406-divvy-tripdata.csv", "202407-divvy-tripdata.csv", "202408-divvy-tripdata.csv")] Cyclistic_data <- files %>% lapply(read_csv) %>% bind_rows() rm(list = setdiff(ls(), "Cyclistic_data")) #starts with 5,699,639 observation. #Creating an area of which the chicago resides in. lat_min <- 41.6445 lat_max <- 42.0230 lng_min <- -87.9401 lng_max <- -87.5237 #A filter clause that will remove any rows that dont reside in the chicago area. Cyclistic_data <- Cyclistic_data %>% filter( start_lat >= lat_min & start_lat <= lat_max & end_lat >= lat_min & end_lat <= lat_max & start_lng >= lng_min & start_lng <= lng_max & end_lng >= lng_min & end_lng <= lng_max) #5,639,036 observations #Clearing dates were the ended date is before starting. Cyclistic_data <- Cyclistic_data %>% filter( ended_at > started_at & as.Date(started_at) == as.Date(ended_at)) # 5,615,553 observations, having a check for column status. str(Cyclistic_data) #Another thing I noticed in excel was some of the start/end station names had some * at the end of some of the names. This will interfer with grouping later! #So this function I'm not too experienced with, but what it is doing as I've interpreted is clearing * from the column, it replaces it with the "" which means nothing as this is a mutate function. #but the & to the right inclosed by another ] is keeping "&" in the string which are in station names. Cyclistic_data <- Cyclistic_data %>% mutate(started_at = gsub("\\*", "", started_at)) %>% mutate(ended_at = gsub("\\*", "", ended_at)) #the previous step changes the columns to chr , need to change them back. Cyclistic_data <- Cyclistic_data %>% mutate(started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S")) %>% mutate(ended_at = as.POSIXct(ended_at, format = "%Y-%m-%d %H:%M:%S")) #Another mutate function that grabs the end time from the column and calculates the time different since start time. I had to search R for this one too. Cyclistic_data <- Cyclistic_data %>% mutate(duration = as.numeric(difftime(ended_at, started_at, units = "mins"))) #When cleaning in excel I noticed some blanks in columns for started at, ended at(small amount around 200) & alot in station name's(over a million).I have some rows in month of September of 2023 that ill remove soon Cyclistic_data <- na.omit(Cyclistic_data) # 4,149,404 obversations are left #Checking to see if done correctly. str(Cyclistic_data$duration) summary(Cyclistic_data) #This is an amazing package, With provided start/end latititude and longitude it can calculate the distance travelled! Cyclistic_data <- Cyclistic_data %>% rowwise() %>% mutate(distance = distHaversine(c(start_lng, start_lat), c(end_lng, end_lat))) #Previously made the result in meters, had to convert to km. Cyclistic_data <- Cyclistic_data %>% mutate(distance = distance / 1000) #This is doing two things, creating a column for date exclusively from the started_at column, then using that newly created column to figure out the day of the week and create a column for that also. Cyclistic_data <- Cyclistic_data %>% mutate(date = as.Date(started_at), day = wday(date, label = TRUE, abbr = FALSE)) #I was having trouble getting the time out in 12-hour format, another one I had to look up. This extracts the time to look like 09AM, 05PM, 12PM, 11AM. They are stored as chr so ill need to order them later. Cyclistic_data <- Cyclistic_data %>% mutate(time = format(started_at, "%I%p")) #I didn't like the leading 0 infront of say 05AM, this removes all leading 0 in the column. Cyclistic_data$time <- sub("^0", "", Cyclistic_data$time) #Using that date column I seperated earlier, remove the day to create a new column exclusively for month, Cyclistic_data <- Cyclistic_data %>% mutate(month = format(date, "%Y-%m")) Cyclistic_data <- Cyclistic_data %>% filter(month != "2023-08") # 4,146,064 observations after removing those dates starting in 08 #Having a check of all the new edits. head(Cyclistic_data$time) head(Cyclistic_data$day) head(Cyclistic_data$month) summary(Cyclistic_data) str(Cyclistic_data) #All information has been extracted, removing these columns. Cyclistic_data <- Cyclistic_data %>% select(-started_at, -ended_at) #During my time on excel I noticed alot of questionable figures. Creating tables for time lengths below 3 minutes, and above 4 hours. long_duration_data <- Cyclistic_data %>% filter(duration > 240) long_duration_summary <- long_duration_data %>% group_by(member_casual) %>% summarise( count = n(), mean_duration = mean(duration, na.rm = TRUE), max_duration = max(duration, na.rm = TRUE)) summary(long_duration_summary) short_duration_data <- Cyclistic_data %>% filter(duration < 3) short_duration_summary <- short_duration_data %>% group_by(member_casual) %>% summarise(count = n(), mean_duration = mean(duration), max_duration = max(duration)) summary(short_duration_summary) # 3,998 observations over 4hours in length. # 289,273 observations under 3 minutes. #Creating density graphs to deplay these results. #You dont want to know how many copies of this plot I created with different variations, I went with simplicity and bolding the text after all the colour changing, vertical start/end lines & alpha line editing ggplot(long_duration_data, aes(x = duration, fill = member_casual)) + geom_density(alpha = 0.5) + labs(title = "Ride Durations over 4 hours", subtitle = "3,998/4,146,064 observations are over 4 hours in the data set.", x = "Duration (minutes)") + theme_minimal() + theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 16), panel.grid = element_blank(), legend.text = element_text(size = 14), axis.title.x = element_text(size = 14), axis.text.x = element_text(size = 12)) + scale_x_continuous(breaks = c(240, seq(300, max(long_duration_data$duration), by = 60))) + guides(fill = guide_legend(title = NULL)) ggplot(short_duration_data, aes(x = duration, fill = member_casual)) + geom_density(alpha = 0.5) + labs(title = "Ride Durations under 3 minutes", subtitle = "289,273/4,146,064 observations are under 3 minutes in the data set.", x = "Duration (minutes)") + theme_minimal() + theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_line(), plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 16), legend.text = element_text(size = 14), axis.title.x = element_text(size = 14), axis.text.x = element_text(size = 12), panel.grid = element_blank()) + guides(fill = guide_legend(title = NULL)) #remove trips under 1 minute but I first want to capture this information, creating new table. under_1min_duration <- Cyclistic_data %>% filter(duration < 1) #At this point this is my last filtering, so it goes from data, to review. Allowing me to keep the data file 1 step away from its review form. Cyclistic_review <- Cyclistic_data %>% filter(duration >= 1) #after filtering out 1min durations count when down to 4,091,950. #time to create some visualisations, remember how I said I converted time and day column's to , these below commands set it in order. Cyclistic_review$time <- factor(Cyclistic_review$time, levels = c("1AM", "2AM", "3AM", "4AM", "5AM", "6AM", "7AM", "8AM", "9AM", "10AM", "11AM", "12PM", "1PM", "2PM", "3PM", "4PM", "5PM", "6PM", "7PM", "8PM", "9PM", "10PM", "11PM", "12AM" )) Cyclistic_review$day <- factor(Cyclistic_review$day, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) #I want to seperately attempt to analyse this data. Being the location of where trip begin and end including coordinates. station_info <- Cyclistic_review %>% select(start_station_name, end_station_name, start_lat, start_lng, end_lat, end_lng, member_casual) # Create top ten end stations for casual members top_end_stations_casual <- station_info %>% filter(member_casual == "casual") %>% group_by(end_station_name) %>% summarise(count = n()) %>% arrange(desc(count)) %>% slice_head(n = 10) # Create top ten end stations for members top_end_stations_member <- station_info %>% filter(member_casual == "member") %>% group_by(end_station_name) %>% summarise(count = n()) %>% arrange(desc(count)) %>% slice_head(n = 10) # Create top ten start stations for casual members top_start_stations_casual <- station_info %>% filter(member_casual == "casual") %>% group_by(start_station_name) %>% summarise(count = n()) %>% arrange(desc(count)) %>% slice_head(n = 10) # Create top ten start stations for members top_start_stations_member <- station_info %>% filter(member_casual == "member") %>% group_by(start_station_name) %>% summarise(count = n()) %>% arrange(desc(count)) %>% slice_head(n = 10) #Because I have that station table I wont need these, besides I have them still in Cyclistic_data table. Cyclistic_review <- Cyclistic_review %>% select(-start_station_id, -end_station_id, -start_station_name, -end_station_name, -ride_id, -start_lat, -start_lng, -end_lat, -end_lng) #Checking. summary(Cyclistic_data) summary(Cyclistic_review) #Creating tables for averages for duration, avg_duration_full <- Cyclistic_review %>% group_by(member_casual, day, month) %>% summarise(avg_duration = mean(duration, na.rm = TRUE)) #While I have made my fair share of ggplots throughout the course, stat = "identity", position = "dodge" is something I also learnt from Equitable Equations. #Because im working with averages divided between casual and members I didnt want it to be 100pct total together but seperate averages. Thats what the stat identity ensures. #position = dodge ensures they dont appear ontop of each other but side by side. #whether I use these graphs or visualisations from other platforms I dont know ggplot(avg_duration_full, aes(x = month, y = avg_duration, fill = member_casual)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Average Trip Duration(mins) by Month") + theme_minimal() + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank()) + guides(fill = guide_legend(title = NULL)) ggplot(avg_duration_full, aes(x = day, y = avg_duration, fill = member_casual)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Average Trip Duration(mins) by Day") + theme_minimal() + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank()) + guides(fill = guide_legend(title = NULL)) Cyclistic_review <- Cyclistic_review %>% filter(month != "2023-08") #created table to place in report, during the course we looked into the use of long and wide format. #I wanted to trial wide format as I've been working with alot of long data, I had to search this command of pivot_wider() as no prior expose. avg_duration_table <- avg_duration_full %>% select(member_casual, month, day, avg_duration) %>% pivot_wider(names_from = month, values_from = avg_duration, names_prefix = "Month_") %>% mutate(across(starts_with("Month_"), ~ paste0(round(.x, 2), " mins"))) #Time to save some files, so both these tables & station_info from earlier. write.csv(station_info, "station_info.csv", row.names = FALSE) write.csv(avg_duration_table, "average_duration_table.csv", row.names = FALSE) write.csv(under_1min_duration, "under_1min_table.csv", row.names = FALSE) write.csv(top_end_stations_casual, "top_end_stations_casual.csv", row.names = FALSE) write.csv(top_end_stations_member, "top_end_stations_member.csv", row.names = FALSE) write.csv(top_start_stations_casual, "top_start_stations_casual.csv", row.names = FALSE) write.csv(top_start_stations_member, "top_start_stations_member.csv", row.names = FALSE) write.csv(long_duration_data, "long_duration_data.csv", row.names = FALSE) write.csv(short_duration_data, "short_duration_data.csv", row.names = FALSE) #Creating table to help create a pie chart for count, not sure how it will turn out. Counter <- Cyclistic_review %>% group_by(member_casual) %>% summarise(count = n()) ggplot(Counter, aes(x = 2, y = count, fill = member_casual)) + geom_bar(stat = "identity", width = 1) + coord_polar(theta = "y") + xlim(0.5, 2.5) + theme_minimal() + theme(axis.title = element_blank(), axis.text.y = element_blank(), axis.text.x = element_blank(), panel.grid = element_blank(), legend.title = element_blank()) + labs(title = "Count of trips: Member vs Casual", subtitle = "Members = 65% of the count\nCasuals = 35% of the count\nTotal count = 4,091,950") + geom_text(aes(label = scales::comma(count)), position = position_stack(vjust = 0.5)) #I want to observe is trips per day by the time of day. #I had to look into how to do this step, within this table I wanted to have the count & percentage per day, the groups = 'drop' helps hold the percentage to the day and not the whole table. group_by_day <- Cyclistic_review %>% group_by(day, time) %>% summarise(count = n(), .groups = 'drop') %>% group_by(day) %>% mutate(percentage = count / sum(count) * 100) monthly_counts <- Cyclistic_review %>% group_by(month, member_casual) %>% summarise(count = n(), .groups = 'drop') bike_type_counts <- Cyclistic_review %>% group_by(rideable_type, member_casual) %>% summarise(count = n(), .groups = 'drop') write.csv(group_by_day, "counts_by_time_by_day.csv", row.names = FALSE) write.csv(monthly_counts, "monthly_counts.csv", row.names = FALSE) write.csv(bike_type_counts, "bike_type_counts.csv", row.names = FALSE) write.csv(Counter, "count_of_trips.csv", row.names = FALSE) write.csv(Cyclistic_data, "Cyclistic_data.csv", row.names = FALSE) write.csv(Cyclistic_review, "Cyclistic_review.csv", row.names = FALSE)


In the code box above is the entirety of my script that I used within R, which showcases examples of me using commands from these packages: tidyverse, geosphere & scales.

I created 2 tables. One for high duration trips of over 4hours & the other for duration for trips under 3 minutes of length.

Here are the results

Trips over 4 hours

alt text

member_casual day time distance duration
casual Sunday 4PM 0.0000000 267.4667
member Friday 5PM 2.2859228 325.0667
casual Friday 11AM 6.1936625 528.8833
casual Friday 6PM 1.3598795 249.5333
casual Sunday 12PM 1.6225216 269.9833
casual Friday 11AM 1.6301813 515.9167
member Thursday 10AM 0.5828608 255.6833
casual Friday 5PM 2.4017389 285.4000
casual Thursday 4AM 4.0835123 490.6333
casual Saturday 4PM 0.9094301 268.3667
member Wednesday 10AM 0.0000000 280.5000
member Friday 9AM 1.1230843 306.9000
member Friday 6AM 3.9930711 435.3833
casual Friday 5PM 2.4017389 283.6167
casual Thursday 8AM 0.0000000 267.4833
casual Friday 6AM 0.0000000 249.1667
casual Saturday 2AM 0.0000000 676.9833
casual Saturday 5PM 1.5309559 255.7833
member Wednesday 5AM 1.0702981 327.8167
member Thursday 4PM 0.4067669 414.1500
casual Saturday 5PM 2.1320309 250.2000
member Thursday 6PM 2.4910387 250.9833
member Saturday 9AM 12.2693843 406.0833
member Sunday 8AM 6.9500147 395.9167
casual Sunday 8AM 0.0000000 295.2500
casual Friday 11AM 0.4441515 272.8167
casual Sunday 12PM 3.1591959 291.0667
casual Saturday 11AM 15.7690985 368.6333
member Saturday 9AM 1.4360204 479.9167
member Thursday 3AM 0.3197353 309.4667
casual Saturday 11AM 0.1516442 273.8333
casual Tuesday 2AM 0.0000000 369.9833
casual Monday 4PM 0.0000000 245.1833
casual Friday 9AM 2.0650160 257.8833
casual Saturday 1PM 2.8782314 241.1333
casual Saturday 6AM 0.0000000 242.1000
casual Saturday 11AM 0.6841452 293.7000
casual Saturday 2AM 1.1627881 607.4167
casual Friday 4PM 0.9984893 318.9000
casual Sunday 8AM 8.2703863 240.4333
member Tuesday 10AM 0.7533263 292.5000
casual Wednesday 8AM 4.7516953 421.6167
member Saturday 1PM 0.6143663 453.4333
casual Saturday 12AM 2.3088144 776.4333
casual Tuesday 12PM 0.0000000 427.8333
casual Sunday 7AM 1.0247790 686.4833
casual Saturday 2PM 0.0000000 272.6333
member Wednesday 6AM 0.7420484 590.9333
casual Sunday 1PM 1.6310436 273.3500
casual Saturday 7AM 0.0000000 270.5333
casual Saturday 11AM 0.0000000 280.6000
casual Saturday 7AM 0.0000000 302.6667
casual Saturday 7AM 2.1935750 539.3167
member Sunday 9AM 0.1527771 348.9500
casual Saturday 11AM 6.2720064 378.7500
casual Thursday 6PM 1.2070448 249.5667
casual Friday 8AM 3.7433061 318.9000
casual Saturday 1AM 2.9389548 589.4167
casual Sunday 2PM 0.8852420 245.2500
casual Sunday 2PM 0.8852420 256.7833
casual Saturday 2AM 0.8975036 996.4667
member Thursday 5AM 7.8023342 256.4667
casual Saturday 11AM 3.7753634 476.0167
member Friday 9AM 3.7736168 299.3167
member Tuesday 8AM 12.9812159 341.4833
casual Monday 9AM 2.4596479 625.9500
casual Tuesday 12AM 0.0000000 570.8000
casual Saturday 11AM 0.6841452 294.8333
casual Thursday 4PM 1.4156344 261.9000
member Sunday 6PM 4.9957001 271.2833
casual Sunday 9AM 1.2094589 357.9167
casual Thursday 11AM 2.9026262 499.5833
member Tuesday 4AM 1.0912617 826.5667
casual Monday 11AM 0.7881408 272.3000
member Friday 10AM 1.5188059 446.4000
casual Friday 6AM 0.0000000 373.4500
member Friday 3PM 9.2963330 335.6667
casual Friday 9AM 1.2888765 418.9833
casual Saturday 12AM 0.4151005 1231.7833
member Thursday 1PM 0.0000000 307.6333
casual Friday 1AM 0.5521328 303.2667
member Thursday 11AM 2.6761359 341.8667
casual Saturday 3PM 1.5046137 308.8667
member Monday 8AM 1.8319554 423.6167
member Sunday 1PM 1.3937289 291.4667
casual Saturday 6PM 0.9404399 246.1500
member Saturday 12PM 6.7687302 259.8000
casual Monday 10AM 0.0000000 349.3500
casual Tuesday 8AM 1.2505911 310.7667
casual Monday 11AM 2.3288199 246.3333
casual Saturday 5PM 2.3532147 312.5000
member Monday 8AM 1.1105557 271.7500
casual Sunday 2PM 1.9878884 377.3000
casual Friday 6PM 5.2870733 240.2500
casual Friday 4PM 2.0296840 403.0333
casual Saturday 11AM 0.0000000 347.9333
casual Friday 8AM 1.0504544 260.6833
casual Friday 8AM 1.0504544 264.8667
member Thursday 11AM 0.4104641 287.0833
casual Friday 1AM 0.0000000 733.8667
casual Wednesday 4PM 0.0000000 242.7667
casual Friday 11AM 0.0000000 258.8167
casual Thursday 10AM 1.4269009 423.0667
casual Saturday 11AM 1.7104408 461.7667
casual Sunday 11AM 1.0161558 473.0667
casual Wednesday 10AM 4.1152733 488.3500
casual Friday 12AM 0.3369670 972.0667
casual Friday 12PM 0.0000000 396.0667
casual Friday 12PM 0.0000000 396.4000
casual Thursday 2PM 0.6085115 243.3833
member Saturday 10AM 0.0000000 257.2333
casual Thursday 1PM 1.6986505 298.7833
casual Wednesday 9AM 6.2952961 527.6333
casual Thursday 8AM 2.2356399 443.4667
casual Monday 1PM 1.4657035 349.0000
casual Friday 12AM 1.2540985 867.5167
casual Saturday 2PM 0.0000000 297.5500
member Monday 6AM 1.6454806 286.4833
casual Saturday 1AM 2.4898893 693.0500
casual Saturday 7AM 0.3323922 301.1667
casual Sunday 1AM 0.0000000 571.3000
member Saturday 12PM 1.6009861 611.8167
casual Monday 10AM 0.5220774 484.7000
casual Saturday 1PM 8.8259985 345.7333
member Thursday 8AM 12.8782159 340.2500
member Tuesday 8AM 0.5638712 426.0500
member Friday 4PM 1.5048510 346.1000
casual Sunday 12PM 2.4477388 439.3167
casual Friday 9AM 0.0000000 241.1500
casual Wednesday 7AM 0.0000000 496.1500
casual Friday 8AM 0.3138223 244.4167
casual Thursday 9AM 2.3694810 263.1667
casual Sunday 12PM 0.7449020 420.4000
casual Sunday 5AM 0.9006750 710.4667
casual Friday 12PM 1.5783301 389.8500
casual Sunday 1PM 1.6206909 380.1667
casual Wednesday 10AM 2.2334020 498.7167
member Sunday 8AM 0.0000000 622.5833
casual Saturday 10AM 1.6942840 290.2167
casual Friday 8AM 0.7746417 324.1167
casual Saturday 7AM 1.7526494 254.6167
casual Saturday 7AM 1.7526494 254.3333
casual Tuesday 3PM 3.3544152 296.8833
casual Saturday 12AM 0.0000000 345.5167
casual Sunday 2PM 5.2877843 240.1833
casual Sunday 2PM 8.9969430 270.2667
casual Wednesday 7AM 0.5867328 429.0667
casual Wednesday 2PM 0.0000000 259.7667
casual Monday 5PM 0.4098025 241.6000
casual Saturday 2AM 0.4112389 280.5333
member Tuesday 5PM 0.9923324 244.8333
member Thursday 8AM 16.1193560 424.2167
member Tuesday 5PM 0.4122580 258.4167
member Tuesday 8AM 1.3748681 466.4667
casual Saturday 5PM 0.0000000 309.6500
casual Saturday 2AM 0.5352304 1006.7833
casual Sunday 12AM 9.5452403 265.9833
casual Friday 7AM 3.2035292 284.1333
casual Thursday 11AM 0.2280966 263.7500
casual Thursday 11AM 0.2280966 468.2667
member Monday 8AM 2.2277700 534.3667
casual Sunday 12PM 8.9292390 403.1333
casual Saturday 11AM 0.3008148 297.0833
casual Saturday 4AM 0.3298823 488.0833
casual Tuesday 10AM 3.2028703 325.8667
casual Wednesday 1PM 3.7338897 354.9000
casual Saturday 6AM 1.3602205 789.2333
casual Sunday 2AM 0.4434909 590.9833
casual Friday 4AM 2.5142259 304.8500
casual Saturday 3AM 1.3345599 537.4000
casual Saturday 12AM 0.9374323 743.3667
casual Monday 12PM 7.2737340 292.4500
casual Thursday 4PM 1.9895150 244.6333
casual Friday 9AM 0.0000000 339.2167
casual Wednesday 3PM 1.9895150 509.6500
casual Friday 12PM 5.6805280 390.3167
casual Saturday 11AM 15.7690985 367.7500
casual Sunday 12PM 1.8780596 410.9500
casual Monday 10AM 10.8390537 424.9333
casual Monday 12PM 0.0000000 241.5833
casual Sunday 11AM 1.8242200 445.0000
member Saturday 12PM 1.3805572 298.5000
casual Sunday 1PM 0.8372766 340.8333
member Friday 11AM 2.0960584 412.7000
member Thursday 12PM 1.4279265 426.0000
casual Friday 5PM 6.2873850 251.9833
casual Tuesday 1PM 0.0000000 599.9667
casual Saturday 4PM 0.0000000 245.0833
casual Thursday 10AM 0.0000000 271.2833
casual Thursday 10AM 0.0000000 270.7500
member Wednesday 9AM 0.7047275 364.9833
casual Friday 11AM 0.0000000 272.8167
casual Friday 3AM 0.9330142 247.2833
casual Saturday 3AM 3.3805878 392.7333
member Saturday 4PM 1.6871995 260.5000
casual Saturday 12AM 0.0000000 806.0833
member Sunday 5AM 4.4027706 370.5333
member Wednesday 9AM 0.0000000 330.4333
member Thursday 12PM 0.0000000 357.3333
casual Wednesday 9AM 1.9123130 262.6167


In the table above is a sample of the Rides over 4 hours in duration, if you scroll you will see majority of the examples where distance covered is not even a kilometer or worse 0. The explanation is simple, riders are returning the bike from where they are picking them up from. This is why I feel the data actually fails in the comprehensiveness part now. As a junior analyst in this position, it would be wise to alert my team of this. Getting their opinion on how I should proceed, given the situation I will continue with my analysis and leave out all results about distance from my analysis but bring it up in my report.

This is somewhat of a rookie mistake on my part I should have thought that through, but on the bright side I was able to notice it during my process phase, investigate further & figure it out.

Trips under 3 minutes

alt text

member_casual day time distance duration
member Thursday 10AM 0.6416725 2.3666667
member Saturday 2PM 0.0000000 0.4833333
member Saturday 2PM 0.0000000 0.0500000
member Sunday 6PM 0.0072693 0.7833333
member Sunday 12PM 0.0084922 1.7333333
casual Tuesday 3PM 0.9837643 2.6500000
member Thursday 4PM 0.6314039 2.4500000
member Tuesday 5PM 0.0000000 1.0333333
member Saturday 8AM 0.0000000 2.7833333
casual Wednesday 1PM 0.3599641 2.3166667
member Sunday 8AM 0.5882149 2.0666667
member Wednesday 8AM 0.5969045 2.0500000
member Wednesday 8AM 0.8836357 2.9666667
member Thursday 9AM 0.4083844 2.0333333
member Saturday 12AM 0.0000000 1.8000000
casual Saturday 10PM 0.0130298 0.7500000
casual Saturday 6PM 0.0000000 0.1500000
member Sunday 2PM 0.0000000 1.0833333
member Thursday 4PM 0.0000000 0.4833333
member Wednesday 4PM 0.2734735 1.4833333
member Monday 7PM 0.6962600 2.6833333
member Friday 3PM 0.0000000 1.4500000
casual Sunday 7PM 0.0000000 0.4166667
casual Thursday 11AM 0.0000000 0.1833333
casual Sunday 11PM 0.0194948 0.3500000
member Tuesday 1AM 0.0150922 0.2166667
member Friday 7AM 0.7618581 2.9833333
casual Friday 11PM 0.0000000 1.3000000
casual Sunday 7PM 0.0018778 1.1333333
casual Friday 4PM 0.0024142 1.2000000
casual Saturday 1AM 0.0000000 1.5833333
member Wednesday 2PM 0.0078976 0.7333333
member Wednesday 10AM 0.0130662 0.9833333
member Wednesday 11PM 0.4251317 1.7333333
member Saturday 1PM 0.0057063 0.1000000
member Sunday 6PM 0.0103789 0.7666667
member Tuesday 1PM 0.0104043 1.0500000
member Tuesday 8PM 0.3794903 2.3666667
member Monday 8AM 0.4466237 2.4833333
member Tuesday 4PM 0.4495019 2.3833333
member Monday 7PM 0.4499959 2.0500000
member Friday 9AM 0.6718145 2.6500000
member Monday 3PM 0.5335743 1.7000000
member Wednesday 3PM 0.5375019 1.8666667
member Wednesday 8AM 0.4251317 1.8833333
casual Saturday 10PM 0.0000000 0.6500000
casual Wednesday 9PM 0.0000000 1.3500000
member Monday 11PM 0.0000000 0.2500000
member Wednesday 4PM 0.0092629 0.5166667
member Saturday 1PM 0.0000000 1.6500000
member Thursday 7AM 0.0106417 1.0166667
member Monday 3PM 0.0224739 0.8166667
casual Saturday 2AM 0.6157032 2.1833333
member Wednesday 7PM 0.2841614 2.0500000
member Monday 7AM 0.0065547 0.7333333
casual Tuesday 8AM 0.4599092 2.4833333
member Wednesday 11AM 0.4599092 2.1666667
member Thursday 11PM 0.0479219 2.2166667
member Friday 11PM 0.0000000 0.1000000
member Saturday 11PM 0.0079514 0.8500000
casual Wednesday 12PM 0.0000000 0.2666667
member Tuesday 4PM 0.0043538 1.9166667
member Monday 8AM 0.0000000 0.8666667
member Monday 3PM 0.0000000 0.8333333
member Saturday 11PM 0.3218281 1.6000000
casual Monday 12PM 0.0067973 1.2333333
casual Monday 1PM 0.0037224 0.4000000
casual Saturday 5PM 0.0415932 0.6666667
casual Saturday 2PM 0.0000000 0.7500000
casual Saturday 2PM 0.0000000 0.5000000
casual Friday 9AM 0.0000000 0.7333333
member Saturday 8PM 0.0127440 1.8166667
member Friday 10PM 0.3197353 1.9500000
member Monday 8PM 0.3197353 1.1833333
casual Friday 5PM 0.3197353 2.0333333
member Sunday 9PM 0.4705321 2.7333333
member Sunday 3PM 0.7752734 2.8833333
member Saturday 1PM 0.6203533 2.5333333
member Friday 5PM 0.3980277 2.4333333
member Monday 3PM 0.0002216 2.4833333
casual Friday 8AM 0.0000000 0.9666667
casual Friday 10PM 0.0000000 0.3666667
member Sunday 10AM 0.0000000 0.3000000
member Monday 12PM 0.2075367 2.1666667
member Monday 7PM 0.5569097 2.7333333
member Friday 12PM 0.7656174 2.3666667
member Monday 7AM 0.7616406 2.9166667
member Tuesday 9AM 0.3805234 2.5833333
member Thursday 9AM 0.3805234 2.4500000
member Friday 12PM 0.0000000 0.9000000
member Friday 8AM 0.0156308 0.6000000
member Sunday 4PM 0.0055419 1.0666667
member Friday 7PM 0.1003547 0.1500000
member Saturday 12PM 0.5810330 2.6833333
member Thursday 2PM 0.4886699 2.9000000
member Tuesday 2PM 0.5552736 2.4000000
member Friday 1PM 0.4825366 2.6000000
member Friday 8PM 0.2280966 2.9000000
member Friday 9AM 0.0000000 0.2666667
member Friday 1AM 0.0000000 0.0666667
member Sunday 3PM 0.0000000 0.5666667
casual Saturday 4PM 0.0000000 0.9333333
casual Saturday 4PM 0.0000000 1.6833333
casual Friday 11PM 0.0117703 0.6333333
casual Friday 6PM 0.0000000 0.4333333
casual Friday 4PM 0.0000000 0.8000000
casual Monday 6PM 0.0000000 0.5333333
casual Sunday 1AM 0.1100531 1.8000000
member Monday 6AM 0.8033813 2.8833333
member Saturday 1PM 0.8430592 2.9166667
member Thursday 8AM 0.4102942 2.1833333
casual Tuesday 2PM 0.3287639 2.0666667
casual Sunday 7AM 0.5378468 2.6333333
member Friday 9AM 0.4910749 2.1833333
member Friday 3PM 0.5638712 2.8666667
member Sunday 11AM 0.4665433 2.7500000
member Saturday 2PM 0.0058531 0.0166667
member Saturday 2PM 0.0052068 0.0166667
member Tuesday 5AM 0.0000000 1.1166667
member Friday 6PM 0.6061398 2.1833333
member Friday 10PM 0.5558077 2.5333333
member Wednesday 3PM 0.0000000 0.1000000
member Friday 2PM 0.0798099 1.6000000
member Sunday 3PM 0.7812582 2.0666667
member Sunday 12AM 0.0104379 0.5000000
member Thursday 4PM 0.0097392 1.0666667
casual Friday 12AM 0.0000000 0.0333333
member Thursday 2PM 0.6733444 2.9500000
member Sunday 2PM 0.6701441 2.7666667
member Friday 7AM 0.4018058 1.8166667
member Sunday 6PM 0.4018058 1.8500000
member Saturday 7PM 0.4018058 2.4166667
member Monday 1PM 0.2939542 1.7166667
member Wednesday 8PM 0.3151344 1.9333333
casual Wednesday 11PM 0.5521328 2.7500000
member Tuesday 5PM 0.5457753 1.8333333
member Tuesday 8AM 0.3597399 2.0833333
member Friday 10PM 0.5277459 2.2000000
member Tuesday 2PM 0.5242002 1.8166667
member Thursday 7AM 0.3433267 1.8166667
member Thursday 5PM 0.5247202 2.4833333
casual Saturday 5PM 0.9577617 2.8333333
member Tuesday 10AM 0.4937251 2.7166667
member Saturday 8AM 0.2146445 1.3166667
member Wednesday 10AM 0.0000000 1.6333333
member Friday 11AM 0.2575454 1.6833333
member Thursday 2PM 0.2918513 2.7166667
member Saturday 4PM 0.4685794 2.4166667
member Wednesday 6PM 0.3967977 2.2500000
member Thursday 3PM 0.8579785 2.9500000
member Friday 7AM 0.4406271 2.1666667
member Tuesday 12PM 0.4410127 2.0666667
member Thursday 8PM 0.4406271 2.7833333
member Thursday 10PM 0.0000000 0.0666667
member Sunday 5PM 0.0000000 1.3500000
member Sunday 4PM 0.0007547 1.1000000
member Monday 1PM 0.3792719 1.6500000
member Wednesday 10AM 0.4892355 2.3000000
member Friday 1PM 0.4892355 2.6666667
member Friday 1PM 0.4892355 1.7500000
member Thursday 6AM 0.4892355 1.9500000
member Friday 2PM 0.5395504 2.8666667
member Sunday 12PM 0.5395504 2.8166667
member Friday 2PM 0.5545163 2.1000000
member Tuesday 9AM 0.5234394 2.9833333
member Sunday 8AM 0.8429966 2.7166667
member Tuesday 9AM 0.0000000 0.0500000
casual Monday 7PM 0.0000000 0.1666667
casual Monday 7PM 0.0000000 0.0500000
member Sunday 1PM 0.0079737 0.3833333
member Sunday 3PM 0.3372566 2.2666667
member Wednesday 9PM 0.0132314 0.2333333
casual Friday 8AM 0.0059751 0.8166667
member Monday 4PM 0.0000000 0.3000000
casual Sunday 6PM 0.0135398 0.7166667
casual Friday 4PM 0.0103112 2.0666667
casual Saturday 11AM 0.0000000 2.0833333
casual Wednesday 4PM 0.0052275 2.2333333
member Friday 2PM 0.3516273 1.8666667
member Tuesday 8PM 0.3766763 1.7833333
member Friday 3PM 0.0000000 1.7833333
member Saturday 5PM 0.0000000 0.9500000
member Wednesday 6PM 0.5629344 2.5000000
member Saturday 4PM 0.5988432 2.5666667
member Monday 6PM 0.3835234 1.2666667
member Friday 5PM 0.0000000 0.3333333
member Saturday 4PM 0.3920508 2.6000000
member Sunday 5PM 0.3920508 2.2500000
member Thursday 7AM 0.5988432 2.3500000
member Tuesday 5PM 0.0000000 0.2333333
member Wednesday 11PM 0.0100569 1.2333333
member Wednesday 5PM 0.4748958 2.7833333
member Monday 4PM 0.3098704 2.5833333
member Wednesday 2AM 0.0000000 0.5500000
member Tuesday 8AM 0.0092253 0.6166667
member Wednesday 9PM 0.0000000 0.4833333
member Friday 1PM 0.4437310 2.3666667
member Saturday 8PM 0.0398147 0.7500000
member Tuesday 8PM 0.0022716 2.0500000
casual Saturday 4PM 0.0000000 1.2333333


At the 0 mark on the X-axis, the count doesn’t even start at 0. Most trips under 1 minute are by casual users, suggesting they might be trialing the service without fully using it. To better analyse differences in rider type, I’ll exclude trips under 1 minute, which total 54,115 values. However, this high level of curiosity will be noted.

It is at this point I switch from Cyclistic_data to Cyclistic Review, by doing so I’m able to keep a file of the combined sheets that are cleaned much like the files I created in the process phase and proceed to Analyse phase with a different file.

What does the data say!

Finally getting that out of the way I was able to build some tables that will help me look at the data better.

Total counts and average trip duration
The combined files were at the beginning 5,699,639 observation. By the time I get to this stage the new count is 4,091,950. Which will be the total amount of rides I will be working with for my analysis.

alt text

-2,650,101 rides were completed by members with a 12 minute average ride duration
-1,441,849 rides were completed by casuals with a 23 minute average ride duration
This COULD suggest that casual riders use the service more for leisure or longer trips, while members might use it more for commuting or shorter, frequent trips.

Top ten starting and ending locations
Both member and casual riders show a ton of activity around the Chicago Loop, which happens to be the second largest business district in the United States as I’ve recently discovered. This area also includes tourist spots, lakefront access points, and residential areas.

For members, the top start locations make up 7.36% of all member rides, and the top end locations account for 7.44% of rides respectively. This really highlights the frequent commuting happening in the business districts and residential areas. High traffic at main transport hubs just adds to this trend, showing how biking is such a crucial part of daily commutes in chicago!

On the flip side, casual riders are all about fun. The top start locations for casual riders make up 14.56% of all rides, and the top end locations account for 14.83% of all rides. Tourist spots, scenic areas, and popular attractions are main highlights of these locations. This could indicate a whole different usage, focusing on leisure and recreational activities.
In addition to this, while creating these tables i found out the distinct number of stations is now 1,646 from the original 692! Despite being only 0.61% of the total 1,646 stations, the top 10 locations account for a large portion of the rides.

alt text alt text alt text alt text

What bike options are riders using more
Classic Bikes: 67% of total rides
-Casual: 35%
-Member: 65%
Electric Bikes: 33% of total rides
Casual: 36%
Member: 64%

Between the two options, Classic bikes are used more with very little difference between the member groups.

alt text

Ride density during the week
Firstly I think its important to separate these two graphs so we can see the difference between the weekday period of Monday - Friday & then Saturday/Sunday

alt text alt text
Out of a total of 4,091,950 trips, 2,886,597 occurred during the week, while 1,205,353 took place over the weekend. Note that the graph sizes should not be interpreted as proportional representations of the overall count; they solely reflect the density of the individual counts in the separated graphs.
From the density we can see a more corporate based usage during the week, and recreational weekend usage. Further eluding that a lot of the riders might be using the bike for work purposes.

What does a yearly review look like?
Comparing members to casuals the peak season for counts is generally the same split but a HUGE drop coming into the winter. Casual riders experienced a chilling 86% drop in rides from October 2023 (124,703 rides) to January 2024 (16,948 rides).
Member riders also saw a decrease, though less harsh, with a 64% reduction from September 2023 (265,546 rides) to January 2024 (93,769 rides).
Below shows density of both member & casual rides throughout the year. alt text

Something a little bit extra
As someone who part takes in a community based sport, I know for a fact that when other people are involved commitment is alot easier. So after getting all these graph I did a quick google of the community based events for cycling in Chicago and found a couple of annual events, I picked two in September of 2023 to look at bike usage averages for the specific time period compared to the rest of the month.
2023 Let’s Ride, Illnois! - With 9 out of the 17 rides that take place in Chicago during the period 8th - 17th of September.
Apple Cider Century An event on the last Sunday of September 24-09-2023.

alt text alt text


With Let’s Ride, Illnois! the 8th-17th equating to around 33% of the month, the percentage of the trips for that part of the month actually falls short.
With Apple Cider Century it wasn’t the lowest Sunday during the month, but usage doesn’t suggest a spike either.
While it is not definite it doesn’t seem like bike riding events in Chicago actually impact the usage of Cyclistic bikes.

Share

Github
Tableau dashboard

Inside the github link will have my script on R as a file, the Power point presentation & the tableau dashboard. A dashboard by itself seemed odd for the project so I created it more as a visual drill down tool stakeholders would be able to look into while I share the main overview in the presentation.
Additionally I have the script of how I would run the presentation which I have very little experience with.
I also added it to the notes of the actual PowerPoint to make it more real life scenario but by no means is this in the running for new upload to TEDx Talks.

Act

This is the part where I make company suggestions which is difficult for me, the issue I see with my suggestions will be the scope. As a junior analyst I would be asking my colleague’s if a suggestion is good or not, if they have already been trialed already or is the company at that stage yet when it comes to outreach. So please take these suggestions as more of the creative sparks that I have.

Moving forward

  1. Partnerships/Sponsorship with local clubs/events: A big part of what makes leisure or sport activities engaging is the community aspect. I’m no scholar but humans are social beings, and when they feel more involved in an activity, they are more likely to commit long term ie- annual memberships. Regular participation strengthens their sense of belonging and community ties, making them more invested in the activity.
    Additionally giving annual discount off a small percentage to people apart of these clubs/events.
  2. App adoption - Much like fitness tracking/order tracking/financial tracking on apps, introduce some of these similar tracking processes for the customer base. The ability to connect with the customer as pointed out above will make them more inclined.
  3. Distance tracking - while you can consider this apart of suggestion two, this needs to be addressed separately. One of the first things I think of is displaying the amount of distance covered by all riders on the company website! But more importantly it will provide better user insights for the team on where casual customers are going - providing advertisement locations for future promotions.

Even if you don’t see immediate response, if you see an increase in your casual rides it might indicate casual members are using the product more. Consider a time you’ve been at the grocery store and you buy the same single item product you buy every week, when right next to it you have the value pack and you think to yourself “I really should just buy the big pack”

Adieu à jamais

While this is the end of the capstone, it doesn’t mean you wont see more of me!
I’ve learnt alot throughout this course & it only makes me excited to learn & share more.
Key skills I’ve learnt from this course include:
-Understanding of stakeholder needs, using evaluation questions to get data driven results
- Best practices when handling information: like proper documentation, storing of files, importance of changelog’s, data cleaning & aggregation.
-Using spreadsheets more effectively and integrating them with other platforms
-Built solid foundation knowledge of SQL, I will definitely need more exposure to it going forward.
-Fundamental skills on Tableau for visualisation. Building informative visualisations is not easy, much like SQL I will definitely have to work on it.
-Fallen in love with R programming, looking at blocks of code with a blank face to now reading it like its a difficult book but gradually picking things up. I’ve effectively completed majority of this capestone on R including this write up itself being an Rmarkdown report.

Thank you for taking the time to look over my first analysis, given the opportunity I would love feedback as I’m very new to this all.
Til next time

alt text